Day 2 - Practical 1 - basic metabolomics data analysis

2024 NUTRIM microbiome & metabolome workshop

Author

Michal Skawinski

Published

June 28, 2024

1 Intro

TODO (study doesnt exist) Today we’re going to look at the metabolomics data, from the same Papa et al. 2012 study on paediatric IBD patients and controls. Note that those data corresponds to the same samples we used yesterday.

The data we will use today come from Liquid Chromatography-Mass Spectrometry (LC-MS) profiling of feces’ metabolites and has already been preprocessed to a table with metabolites’ relative concentrations for each sample.

Raw data and chromatograms

Usually, for the detection of metabolites we use mass spectrometry, coupled to other technologies (like liquid or gas chromatography), depending on the sample characteristics and study aims. The raw data consists of chromatograms showing peaks of intensity of detected metabolites against their mass-to-charge ratios (see figure below:)

How to read a chromatogram?

Common preprocessing steps

In short, data processing perform peaks detection, retention time alignment and integration of peak areas. Additionall steps include removing of zero’s and imputation of missing values. However, today we will only focus on the important concepts of the preprocessed metabolomics data analysis.

PS. The detailed data generation protocols can be found here.

In this part, we’re going to foucus…

In the second part we will focus on more advanced data exploration methods for metabolomics.

2 Learning goals 💪

TODO - Practice general R skills to inspect … -

3 Load R packages and some functions 📦

library(readxl)
library(here)
library(tidyverse)
library(MetabolAnalyze) # contains pareto-scaling
# We also load an R script, with some useful functions we're going to use today.
source("functions.R") 

4 Read and inspect data 🔍

Read the metabolites’ relative concentrations table, stored in a a .csv file.

metabolites <- read.csv(file = here("data/papa2012/processed/metabolites_short.csv"), row.names = 1)
metabolites[1:10, 1:10]
               V1     V2     V3      V4       V5      V6      V7      V8     V9
CSM5FZ3T   790031 124175  95316  818174 73935019  331452  662877 2229497  44391
CSM5FZ44  1106701 322143  51995  667889 56592824  622792  281607 4308157  39960
CSM5FZ48 14743075  14974  20352  582356 44550963 1199342 3537295 3202186 743898
CSM5FZ4A  5277616  25217  66435  122080  9161552  733011  230294 1022091 147853
CSM5FZ4K  9419959  15874  13002   87326  6254831  135938   83969 1813658  78266
CSM5FZ4O   341092  30670  29092  216458 14236015  429151  318548 2351268 140408
CSM5MCTZ 11487127 229118 121572 1059328 72279000  117107  290841  674867  21303
CSM5MCUM 48781145 134122 148054  536873 41369143  619150  673328 3993021 199937
CSM5MCUU  4460140 180855 124893  149804 11058836  624780  543346 1660035 274899
CSM5MCVN   585886 262394 138361  721657 60910005  167582  219415 2598553   3369
              V10
CSM5FZ3T 96982956
CSM5FZ44 84201363
CSM5FZ48 90547462
CSM5FZ4A  4834431
CSM5FZ4K  8305571
CSM5FZ4O 11587139
CSM5MCTZ 55783394
CSM5MCUM 33364290
CSM5MCUU  9015164
CSM5MCVN 88296989

Read the metabolites metadata file, stored in a .cvs file.

metadata_metabolites <- read.csv(file = here("data/papa2012/processed/metadata_metabolites.csv"), row.names = 1)
head(metadata_metabolites)
  SampleID ParticipantID Age Diagnosis    Sex
1 CSM5FZ3T         C3002  76        CD Female
2 CSM5FZ44         C3002  76        CD Female
3 CSM5FZ48         C3003  43        UC Female
4 CSM5FZ4A         C3004  47        UC Female
5 CSM5FZ4K         C3003  43        UC Female
6 CSM5FZ4O         C3005  76        UC Female

Metabolites

  • Samples are stored in rows, e.g. CSM5FZ3T, CSM5FZ44
  • Metabolites are stored in columns, in the form of relative concentrations
  • Note, that the metabolites don’t have identifiers. It’s because we have not identified the specific metabolites yet and those variables represent a detected ion with unique properties, based on which we can later try to identify the metabolite.
  • Additionally, to reduce the computation time we have chosen only a subset of the original metabolites and samples.

Metadata

  • Information on sample IDs, participant IDs, age, diagnosis and sex.
  • This makes it possible to compare groups and explore the differences and similarities between them.

Now let’s inspect those data more.

How many metabolites are there?

ncol(metabolites)
[1] 423

Are there any missing data?

any(is.na(metabolites))
[1] FALSE

How many zero’s do we have per metabolite?

colSums(metabolites == 0)
  V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12  V13  V14  V15  V16 
   0    0   11    7    1   11    0    0    1    2    1    6    2    0    7    0 
 V17  V18  V19  V20  V21  V22  V23  V24  V25  V26  V27  V28  V29  V30  V31  V32 
   0    2    0    0    0    0    8   10    1    1    3    6    1    1    0    0 
 V33  V34  V35  V36  V37  V38  V39  V40  V41  V42  V43  V44  V45  V46  V47  V48 
   2   10    0    1    6    1    8    2    2    1    7    8    9    0   12    0 
 V49  V50  V51  V52  V53  V54  V55  V56  V57  V58  V59  V60  V61  V62  V63  V64 
   2   12    3    2    3   12    7    3    0    6    2   11   10    8    3    0 
 V65  V66  V67  V68  V69  V70  V71  V72  V73  V74  V75  V76  V77  V78  V79  V80 
   0    9    1   10    5    0    1    0    7    0   11    0    4    0    5    2 
 V81  V82  V83  V84  V85  V86  V87  V88  V89  V90  V91  V92  V93  V94  V95  V96 
   9    8    2    0    3    5    0    0    3    0    0    6    0    8    2    6 
 V97  V98  V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 
   2    1    1    1    0    7   10   12    2    8   10    8    4    8    2    2 
V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 
  11    4   11    0    8    2    5    4    4    4    6    0    0    1    0    0 
V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 
   1    0    0    4    0    0    0    0    0    6    2    0    1   12    0    5 
V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 
   0    6    0    0    0    0    0    6    1    4    0   10    0    3    0    0 
V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 
   0    0    0    0    6    0    0   11    0   10    1    5    6    0    0    5 
V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 
   4    7   10    8    3    7    6    0   11    0   10    0    4    6    1    2 
V193 V194 V195 V196 V197 V198 V199 V200 V201 V202 V203 V204 V205 V206 V207 V208 
  11    2    1    5   10    3    0    5    2    2    6    7    0    0    5    0 
V209 V210 V211 V212 V213 V214 V215 V216 V217 V218 V219 V220 V221 V222 V223 V224 
   5    7    4   12    0    5    0    6    8    2    5    3    5    0    2    4 
V225 V226 V227 V228 V229 V230 V231 V232 V233 V234 V235 V236 V237 V238 V239 V240 
   0    0    2    1   10    8    3    0    0    6    8    3    5    1   11    1 
V241 V242 V243 V244 V245 V246 V247 V248 V249 V250 V251 V252 V253 V254 V255 V256 
   3    0    2   11    8    2    9    1    4    9   10    0    2    6    5    4 
V257 V258 V259 V260 V261 V262 V263 V264 V265 V266 V267 V268 V269 V270 V271 V272 
   3    4    1    2   11   10    0    0    0    8    1    0    0    3    0    0 
V273 V274 V275 V276 V277 V278 V279 V280 V281 V282 V283 V284 V285 V286 V287 V288 
   0   10    0    4    6    5    8    0    0   12    9    0   11    2    3    7 
V289 V290 V291 V292 V293 V294 V295 V296 V297 V298 V299 V300 V301 V302 V303 V304 
   2    2    6    5   10    8    0    0    2   10   11   12    0    3   10    7 
V305 V306 V307 V308 V309 V310 V311 V312 V313 V314 V315 V316 V317 V318 V319 V320 
   6    1    3    7    1    4    7    1    1    4    2   10    6    3    4    0 
V321 V322 V323 V324 V325 V326 V327 V328 V329 V330 V331 V332 V333 V334 V335 V336 
   0    1    7    0    9    0   12   12    2    0    5    1    2    0   11    0 
V337 V338 V339 V340 V341 V342 V343 V344 V345 V346 V347 V348 V349 V350 V351 V352 
   0    0    0    3    1    1    0    1   10    0    7    4    0    0    0    5 
V353 V354 V355 V356 V357 V358 V359 V360 V361 V362 V363 V364 V365 V366 V367 V368 
   4   11    6    1    0    1    0    0    4    1    2    1    2    7    5    5 
V369 V370 V371 V372 V373 V374 V375 V376 V377 V378 V379 V380 V381 V382 V383 V384 
  12    8    9    9    4    1    0    3    8   12   12    1    3    4   11    0 
V385 V386 V387 V388 V389 V390 V391 V392 V393 V394 V395 V396 V397 V398 V399 V400 
   0    9    0    2    8    2    3    5    1    0    5    7   12    6    5   10 
V401 V402 V403 V404 V405 V406 V407 V408 V409 V410 V411 V412 V413 V414 V415 V416 
   3   12    7    7    1    2    6    0    3    9    5    6   10    0    5    9 
V417 V418 V419 V420 V421 V422 V423 
  12    3   12    6    2    0    7 

Inspect the structure of the data. Tip: use view() or structure()

# write your code below

How many samples with different Diagnosis do we have? Tip: use table()

# write your code below
structure(metadata_metabolites)
    SampleID ParticipantID Age Diagnosis    Sex
1   CSM5FZ3T         C3002  76        CD Female
2   CSM5FZ44         C3002  76        CD Female
3   CSM5FZ48         C3003  43        UC Female
4   CSM5FZ4A         C3004  47        UC Female
5   CSM5FZ4K         C3003  43        UC Female
6   CSM5FZ4O         C3005  76        UC Female
7   CSM5MCTZ         C3006  32        UC   Male
8   CSM5MCUM         C3006  32        UC   Male
9   CSM5MCUU         C3005  76        UC Female
10  CSM5MCVN         C3002  76        CD Female
11  CSM5MCX3         C3006  32        UC   Male
12  CSM5MCXF         C3012  37        CD Female
13  CSM5MCY8         C3005  76        UC Female
14  CSM5MCZF         C3012  37        CD Female
15  CSM67U9T         C3015  50        UC Female
16  CSM67U9V         C3016  32        CD Female
17  CSM67U9X         C3017  45        CD   Male
18  CSM67UBH         C3002  76        CD Female
19  CSM67UBN         C3002  76        CD Female
20  CSM67UBZ         C3003  43        UC Female
21  CSM67UDN         C3004  47        UC Female
22  CSM67UEA         C3005  76        UC Female
23  CSM67UEI         C3005  76        UC Female
24  CSM67UFZ         C3006  32        UC   Male
25  CSM67UGO         C3012  37        CD Female
26  CSM67UH7         C3022  69    nonIBD   Male
27  CSM79HIT         C3016  32        CD Female
28  CSM79HJ6         C3017  45        CD   Male
29  CSM79HK1         C3002  76        CD Female
30  CSM79HLM         C3003  43        UC Female
31  CSM79HMN         C3006  32        UC   Male
32  CSM79HOJ         C3012  37        CD Female
33  CSM79HQN         C3035  62        CD   Male
34  CSM79HRG         C3031  NA        CD Female
35  CSM7KON2         C3012  37        CD Female
36  CSM7KON8         C3012  37        CD Female
37  CSM7KOO9         C3031  NA        CD Female
38  CSM7KOOF         C3031  NA        CD Female
39  CSM7KOP8         C3032  42        UC Female
40  CSM7KOPC         C3032  42        UC Female
41  CSM7KOPE         C3032  42        UC Female
42  CSM7KOQ7         C3037  46        UC Female
43  CSM7KOQZ         C3017  45        CD   Male
44  CSM7KORU         C3034  36        UC Female
45  CSM7KOTU         C3035  62        CD   Male
46  CSM7KOU9         C3032  42        UC Female
47  CSM7KOUF         C3032  42        UC Female
48  CSM7KOUL         C3037  46        UC Female
49  CSM7KOUR         C3037  46        UC Female
50  CSM9X1XW         C3034  36        UC Female
51  CSM9X1YV         C3035  62        CD   Male
52  CSM9X1ZC         C3031  NA        CD Female
53  CSM9X211         C3037  46        UC Female
54  CSM9X213         C3037  46        UC Female
55  CSM9X21J         C3034  36        UC Female
56  CSM9X21P         C3034  36        UC Female
57  CSM9X22G         C3032  42        UC Female
58  CSM9X22Q         C3032  42        UC Female
59  CSM9X22U         C3037  46        UC Female
60  CSM9X237         C3035  62        CD   Male
61  CSM9X23B         C3035  62        CD   Male
62  CSM9X23L         C3037  46        UC Female
63  CSM9X23P         C3037  46        UC Female
64  ESM5GEZ6         E5004   7        UC Female
65  ESM5MEBU         E5004   7        UC Female
66  ESM5MEC5         E5004   7        UC Female
67  ESM718SY         E5004   7        UC Female
68  HSM5FZBJ         H4001  14        CD   Male
69  HSM5MD3L         H4014  10        CD Female
70  HSM5MD41         H4010  13        UC   Male
71  HSM5MD43         H4010  13        UC   Male
72  HSM5MD4O         H4007  15        CD Female
73  HSM5MD4Q         H4007  15        CD Female
74  HSM5MD5D         H4008  13    nonIBD Female
75  HSM5MD6A         H4014  10        CD Female
76  HSM5MD6K         H4013   8    nonIBD   Male
77  HSM5MD82         H4009   6    nonIBD Female
78  HSM5MD87         H4010  13        UC   Male
79  HSM5MD8B         H4009   6    nonIBD Female
80  HSM67VFZ         H4008  13    nonIBD Female
81  HSM67VG6         H4008  13    nonIBD Female
82  HSM67VGA         H4009   6    nonIBD Female
83  HSM67VGG         H4009   6    nonIBD Female
84  HSM67VH1         H4010  13        UC   Male
85  HSM67VI5         H4024  11    nonIBD   Male
86  HSM6XRQE         H4019  11        UC Female
87  HSM6XRRD         H4019  11        UC Female
88  HSM6XRRV         H4014  10        CD Female
89  HSM6XRS8         H4015  15        CD   Male
90  HSM6XRSN         H4007  15        CD Female
91  HSM6XRSZ         H4008  13    nonIBD Female
92  HSM6XRTC         H4009   6    nonIBD Female
93  HSM6XRTM         H4010  13        UC   Male
94  HSM6XRTQ         H4010  13        UC   Male
95  HSM6XRV2         H4019  11        UC Female
96  HSM6XRVC         H4013   8    nonIBD   Male
97  HSM6XRVK         H4013   8    nonIBD   Male
98  HSM6XRVO         H4014  10        CD Female
99  HSM6XRVW         H4014  10        CD Female
100 HSM7CYYZ         H4009   6    nonIBD Female
101 HSM7CYZJ         H4022   9    nonIBD Female
102 HSM7CZ1T         H4027  15        UC   Male
103 HSM7CZ2P         H4008  13    nonIBD Female
104 HSM7J4GR         H4027  15        UC   Male
105 HSM7J4HO         H4035  16        UC   Male
106 HSM7J4HS         H4035  16        UC   Male
107 HSM7J4I7         H4024  11    nonIBD   Male
108 HSM7J4JT         H4040  17        UC Female
109 HSM7J4JZ         H4035  16        UC   Male
110 HSM7J4K8         H4035  16        UC   Male
111 HSM7J4KK         H4023  16    nonIBD   Male
112 HSM7J4M4         H4019  11        UC Female
113 HSM7J4ME         H4019  11        UC Female
114 HSM7J4NE         H4027  15        UC   Male
115 HSM7J4O1         H4024  11    nonIBD   Male
116 HSM7J4O3         H4024  11    nonIBD   Male
117 HSM7J4OE         H4031  12        CD   Male
118 HSM7J4PE         H4019  11        UC Female
119 HSM7J4QT         H4014  10        CD Female
120 HSM7J4R2         H4013   8    nonIBD   Male
121 HSMA33IY         H4042  15        UC   Male
122 HSMA33J5         H4045  14    nonIBD Female
123 HSMA33JH         H4031  12        CD   Male
124 HSMA33M8         H4040  17        UC Female
125 HSMA33O1         H4027  15        UC   Male
126 HSMA33O5         H4027  15        UC   Male
127 HSMA33OR         H4035  16        UC   Male
128 HSMA33QO         H4045  14    nonIBD Female
129 HSMA33SG         H4045  14    nonIBD Female
130 HSMA33SK         H4045  14    nonIBD Female
131 MSM5LLDA         M2008  30        CD Female
132 MSM5LLDS         M2008  30        CD Female
133 MSM5LLGH         M2014  30        CD   Male
134 MSM5LLGR         M2028  24        CD Female
135 MSM5LLHQ         M2008  30        CD Female
136 MSM5LLHZ         M2008  30        CD Female
137 MSM5LLIC         M2025  43        CD Female
138 MSM5LLIG         M2021  26        CD   Male
139 MSM5LLIK         M2021  26        CD   Male
140 MSM6J2H9         M2039  40    nonIBD Female
141 MSM6J2IQ         M2014  30        CD   Male
142 MSM6J2J1         M2014  30        CD   Male
143 MSM6J2LB         M2008  30        CD Female
144 MSM6J2PE         M2025  43        CD Female
145 MSM6J2PO         M2047  57    nonIBD   Male
146 MSM6J2PU         M2047  57    nonIBD   Male
147 MSM6J2Q3         M2028  24        CD Female
148 MSM6J2QZ         M2014  30        CD   Male
149 MSM79H7G         M2071  26        UC Female
150 MSM79H7M         M2071  26        UC Female
151 MSM79H7Q         M2047  57    nonIBD   Male
152 MSM79H7Y         M2047  57    nonIBD   Male
153 MSM79H9Q         M2061  56    nonIBD   Male
154 MSM79HAH         M2039  40    nonIBD Female
155 MSM79HAJ         M2064  74        UC   Male
156 MSM79HAN         M2064  74        UC   Male
157 MSM79HC8         M2047  57    nonIBD   Male
158 MSM79HD6         M2071  26        UC Female
159 MSM79HDA         M2069  29        UC Female
160 MSM79HDE         M2069  29        UC Female
161 MSM79HDQ         M2068  19        CD Female
162 MSM9VZEK         M2068  19        CD Female
163 MSM9VZES         M2068  19        CD Female
164 MSM9VZEW         M2069  29        UC Female
165 MSM9VZFR         M2079  29    nonIBD   Male
166 MSM9VZGO         M2061  56    nonIBD   Male
167 MSM9VZHF         M2061  56    nonIBD   Male
168 MSM9VZHR         M2072  51    nonIBD   Male
169 MSM9VZIM         M2083  25        UC   Male
170 MSM9VZIQ         M2083  25        UC   Male
171 MSM9VZJZ         M2075  61    nonIBD   Male
172 MSM9VZKE         M2079  29    nonIBD   Male
173 MSM9VZL9         M2084  23    nonIBD Female
174 MSM9VZLJ         M2061  56    nonIBD   Male
175 MSM9VZLP         M2064  74        UC   Male
176 MSM9VZLV         M2064  74        UC   Male
177 MSM9VZNL         M2047  57    nonIBD   Male
178 MSM9VZNX         M2083  25        UC   Male
179 MSM9VZOU         M2069  29        UC Female
180 MSM9VZP3         M2069  29        UC Female
181 MSM9VZPL         M2085  23        CD   Male
182 MSM9VZPV         M2060  62    nonIBD Female
183 MSMA267V         M2103  35        UC Female
184 MSMA26AL         M2068  19        CD Female
185 MSMA26BB         M2084  23    nonIBD Female
186 MSMA26BR         M2079  29    nonIBD   Male
187 MSMA26EJ         M2069  29        UC Female
188 MSMAPC5D         M2083  25        UC   Male
189 MSMAPC5F         M2097  21    nonIBD   Male
190 MSMAPC62         M2103  35        UC Female
191 MSMAPC6G         M2085  23        CD   Male
192 MSMAPC6K         M2084  23    nonIBD Female
193 MSMAPC7J         M2075  61    nonIBD   Male
194 MSMAPC7T         M2077  32    nonIBD Female
195 MSMB4LXW         M2097  21    nonIBD   Male
196 MSMB4LYB         M2085  23        CD   Male
197 MSMB4LYH         M2103  35        UC Female
198 MSMB4LZ4         M2071  26        UC Female
199 MSMB4LZ8         M2079  29    nonIBD   Male
200 MSMB4LZC         M2077  32    nonIBD Female
201 MSMB4LZR         M2083  25        UC   Male
202 MSMB4LZX         M2084  23    nonIBD Female
203 PSM6XBQY         P6005  11        CD   Male
204 PSM6XBS6         P6009  16        CD   Male
205 PSM6XBSI         P6010  10        CD   Male
206 PSM6XBSS         P6005  11        CD   Male
207 PSM6XBTF         P6009  16        CD   Male
208 PSM6XBUG         P6010  10        CD   Male
209 PSM6XBUO         P6010  10        CD   Male
210 PSM7J12V         P6025  17        UC   Male
211 PSM7J134         P6025  17        UC   Male
212 PSM7J136         P6005  11        CD   Male
213 PSM7J13E         P6005  11        CD   Male
214 PSM7J141         P6009  16        CD   Male
215 PSM7J14L         P6010  10        CD   Male
216 PSM7J158         P6016  16        CD   Male
217 PSM7J15U         P6018  17    nonIBD Female
218 PSM7J161         P6028   9        CD   Male
219 PSM7J167         P6028   9        CD   Male
220 PSM7J177         P6028   9        CD   Male
221 PSM7J17F         P6028   9        CD   Male
222 PSM7J17L         P6016  16        CD   Male
223 PSM7J188         P6033  15        CD   Male
224 PSM7J193         P6013   6        UC   Male
225 PSM7J19J         P6016  16        CD   Male
226 PSM7J1CC         P6005  11        CD   Male
227 PSM7J1CK         P6009  16        CD   Male
228 PSM7J1CU         P6009  16        CD   Male
229 PSMA264Q         P6035  16        UC   Male
230 PSMA265N         P6016  16        CD   Male
231 PSMA265T         P6016  16        CD   Male
232 PSMA265X         P6038  16        UC Female
233 PSMA266C         P6028   9        CD   Male
234 PSMA2675         P6038  16        UC Female
235 PSMA267J         P6035  16        UC   Male
236 PSMA267P         P6035  16        UC   Male
237 PSMA269E         P6028   9        CD   Male
238 PSMA269W         P6038  16        UC Female
239 PSMA26A3         P6038  16        UC Female
240 PSMB4MBK         P6035  16        UC   Male
241 PSMB4MC5         P6038  16        UC Female

The metadata is the same as for the microbiome ! 👀

table(metadata_metabolites$Diagnosis)

    CD nonIBD     UC 
    87     55     99 

5 Data preatreatment 🔍

Before we can start using some more fancy unsupervised or supervised techniques, we need to understand the importance of data preatreatment, that includes:

  1. normalization,

  2. transformation, and

  3. scaling.

We will explore their impact and importance on the data distribution as well as common methods in metabolomics.

5.1 Normalization:

TODO explain here ..

TODO explain pqn here

The structure of the function we have provided is as follows:

pqn(data_matrix)

The function accepts a data_matrix and return a pqn normalized matrix, with the same structure (metabolites in columns and samples in rows).

TASK 1

Normalize the data using pqn normalization and assign it to a new variable, e.g. metabolites_pqn.

metabolites_pqn <- pqn(metabolites)

TODO: explain TAN here: For each sample, every metabolite intensity value is vivided divided by the total sum of all metabolite intensity values measured in that sample ( values ignored by default), before multiplication by 100. The unit is %.

We have provided TAN normalization in a function called normalisze_tan(), which is analogical to the pqn function:

normalize_tan(data_matrix)
TASK 1

Normalize the metabolites data using TAN normalizatio and assign it to a new variable, e.g. metabolites_tan.

metabolites_tan <- normalize_tan(metabolites)

From here, we are going to use pqn normalized metabolites data for further analysis.

5.2 Scaling & transformation

TODO: explain transformation TODO: explain scaling and when and where

Let’s inspect how the distribution changes before and after transformation/scaling, based on the first metabolite in our data:

hist(metabolites_pqn[,1],
     main = "Original pqn normalized data",
     xlab = colnames(metabolites_pqn)[1])

TODO explain autoscaling here

Auto scaling is already implemented in the base R, as a scale() function.

TASK 1

Inspect the scale() function. A manual in the ‘Help’ tab of RStudio should open providing explanation and example usage of the function:

# execute the code ----->
?base::scale
TASK 2

Let’s auto scale the data. For this, we need to specify parameter’s values center=TRUE and scale=TRUE.

autoscaled_metabolites_pqn <- scale(metabolites_pqn,center = TRUE, scale = TRUE)
TASK 3

Let’s scale the data and plot the distribution of the first metabolite after auto scaling.

hist(autoscaled_metabolites_pqn[,1],
     main = "Auto scaling",
     xlab = colnames(metabolites_pqn)[1])

TODO: explaing pareto scaling

To compute pareto scaling, we can use scaling() function from the MetabolAnalyze package.

TASK 1

Inspect the scale() function. A manual in the ‘Help’ tab of RStudio should open providing explanation and example usage of the function:

# execute the code ----->
?MetabolAnalyze::scaling
TASK 2

Let’s scale the data using pareto scaling. We need to specify a value of an argument type.

pareto_metabolites_pqn <- scaling(metabolites_pqn, type = "pareto")
TASK 3

Let’s plot the distribution of the first metabolite after pareto scaling.

hist(pareto_metabolites_pqn[,1],
     main = "Pareto scaling",
     xlab = colnames(metabolites_pqn)[1])

We have already explored the concept of log transformation yesterday.

To log transform our data, we can simply use log(). By default, log10 is caluclated, but we can specify the base it in the base argument of the function:

log(x, base)
TASK 1

Let’s perfom log transformation on our data using 10 as a base. PS. Remember to add a pseudo-count to data.

log_metabolites_pqn <- log(metabolites_pqn +1)
TASK 2

Let’s plot the distribution of the first metabolite after log transformation.

hist(log_metabolites_pqn[,1],
     main = "Log transformation",
     xlab = colnames(metabolites_pqn)[1])

TODO: explain square roorthere

To square root transform our data, we can simply use sqrt().

TASK 1

Let’s perfom square root transformation on the data.

sqrt_metabolites_pqn <- sqrt(metabolites_pqn)
TASK 2

Let’s plot the distribution of the first metabolite after log transformation.

hist(sqrt_metabolites_pqn[,1],
     main = "Square root transformation",
     xlab = colnames(metabolites_pqn)[1])

6 Principal Component Analysis

We are going to run some PCA analysis on metabolites data and plot the results. It’s a great unsupervised tool to explore the data and identify any potential outliers, clusterings and structure in the data.

Let’s load the useful packages first, factoextra and plotly:

library(factoextra) # PCA functions
library(plotly) # interactive plots

6.1 PCA model

To perform PCA, we can use prcomp function, which has a folliwing structure:

prcomp(data, 
       center = TRUE, 
       scale = TRUE)

Note, that prcomp automatically performs autoscaling, however we can control this with parameters center and scale.

It’s important to autoscale the data before PCA, to give the equal importance to variables in the data, to avoid them “dragging” the decomposition towards some. variables.

TASK 1

Let’s compute our first PCA:

# write your code here
pca <- prcomp(metabolites_pqn,
       center = TRUE, 
       scale = TRUE)

Inspect the structure of the resulting model. Use str().

str(pca)
List of 5
 $ sdev    : num [1:241] 9.35 5.44 4.7 4.03 3.61 ...
 $ rotation: num [1:423, 1:241] -0.0148 -0.0519 -0.0343 0.0706 0.067 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:423] "V1" "V2" "V3" "V4" ...
  .. ..$ : chr [1:241] "PC1" "PC2" "PC3" "PC4" ...
 $ center  : Named num [1:423] 16824772 600613 172033 881174 64772837 ...
  ..- attr(*, "names")= chr [1:423] "V1" "V2" "V3" "V4" ...
 $ scale   : Named num [1:423] 22453073 1603145 227900 1123252 87005472 ...
  ..- attr(*, "names")= chr [1:423] "V1" "V2" "V3" "V4" ...
 $ x       : num [1:241, 1:241] 7.69 9.9 6.19 1.84 6.07 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:241] "CSM5FZ3T" "CSM5FZ44" "CSM5FZ48" "CSM5FZ4A" ...
  .. ..$ : chr [1:241] "PC1" "PC2" "PC3" "PC4" ...
 - attr(*, "class")= chr "prcomp"
  1. sdev - the standard deviations of the principal components

  2. rotation - variable loadings

  3. x - scores

6.2 Visualizing the results

We can visualize the percentage of explained varaince for each compount using plot.

plot(pca)

Let’s try to plot the score plot of the first PC against the second and visualize the Diagnosis! We also plot the explained variance of those component, by using get_eigenvalue() function on pca model.

Code
ggplot(pca$x, aes(x = PC1, 
                  y = PC2,
                  color = metadata_metabolites$Diagnosis)) +
  geom_point() +
  labs(x = paste0("PC1 (",round(get_eigenvalue(pca)[1,2],2),"%)"), 
       y =  paste0("PC2 (",round(get_eigenvalue(pca)[2,2],2),"%)")) +
  theme_bw()

Let’s look at the score plot, if we would skip autoscaling.

Code
pca_plain <- prcomp(metabolites_pqn, 
       center = FALSE, 
       scale = FALSE)
ggplot(pca_plain$x, aes(x = PC1, 
                        y = PC2,
                        color = metadata_metabolites$Diagnosis)) +
  geom_point() +
  labs(x = paste0("PC1 (",round(get_eigenvalue(pca_plain)[1,2],2),"%)"), 
       y =  paste0("PC2 (",round(get_eigenvalue(pca_plain)[2,2],2),"%)")) +
  theme_bw()

The larger variables pull the decomposition towards them, therefore we are not able to distinguish between variables.

6.3 Outliers

Notice on our scales PCA plot how some of the samples lay furhter away from the clump of points. Those points are potential outliers.

We can make an interactive plot and inspect which samples are outlying by pointing a cursor over points. For that, we can use ggplotly function from plotly package.

Code
p <- ggplot(pca$x, aes(x = PC1, 
                       y = PC2, 
                       z = rownames(pca$x),
                       color = metadata_metabolites$Diagnosis)) +
  geom_point() +
  theme_bw()

#ggplotly(tooltip = c("x", "y", "z", "color"))

Note that scale = TRUE cannot be used if there are zero or constant (for center = TRUE) variables.

plot() print()

TODO: later PCA (1: not scaled, 2: scaled + log ),

URF: normalized library(factoextra) library(randomForest) urf <- script_URF(nr_itteration = 100, real_data = metabolites_pqn, nr_trees = 1500, nr_samples = 8, class = metadata_metabolites$Diagnosis)

ggplot(as.data.frame(urf\(pc\)x), aes(x = PC1, y = PC2, z = rownames(metabolites_pqn))) + geom_point(aes(color = metadata_metabolites\(Diagnosis)) + labs(x = paste0("PC1 (",round(urf\)pr[1,2],2),“%)”), y = paste0(“PC2 (”,round(urf$pr[2,2],2),“%)”)) + theme_bw()

ASCA: synthetic time point (divide) .rs.restartR() # faktory: mlodzi vs adults

table(metadata_metabolites_small\(diagnosis[(which(metadata_metabolites_small\)consent_age <18))])

table(metadata_metabolites_small\(diagnosis[(which(metadata_metabolites_small\)consent_age >= 18 & metadata_metabolites_small$consent_age <= 60))])

7 Session info

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Sonoma 14.2.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Paris
 date     2024-06-28
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package        * version date (UTC) lib source
 bitops           1.0-7   2021-04-24 [1] CRAN (R 4.3.0)
 caTools          1.18.2  2021-03-28 [1] CRAN (R 4.3.0)
 cellranger       1.1.0   2016-07-27 [1] CRAN (R 4.3.0)
 cli              3.6.3   2024-06-21 [1] CRAN (R 4.3.3)
 colorspace       2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
 data.table       1.15.0  2024-01-30 [1] CRAN (R 4.3.1)
 digest           0.6.36  2024-06-23 [1] CRAN (R 4.3.3)
 dplyr          * 1.1.4   2023-11-17 [1] CRAN (R 4.3.1)
 ellipse        * 0.5.0   2023-07-20 [1] CRAN (R 4.3.0)
 evaluate         0.23    2023-11-01 [1] CRAN (R 4.3.1)
 factoextra     * 1.0.7   2020-04-01 [1] CRAN (R 4.3.0)
 fansi            1.0.6   2023-12-08 [1] CRAN (R 4.3.1)
 farver           2.1.2   2024-05-13 [1] CRAN (R 4.3.3)
 fastmap          1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 forcats        * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 generics         0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
 ggplot2        * 3.5.1   2024-04-23 [1] CRAN (R 4.3.1)
 ggrepel          0.9.5   2024-01-10 [1] CRAN (R 4.3.1)
 glue             1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
 gplots         * 3.1.3.1 2024-02-02 [1] CRAN (R 4.3.1)
 gtable           0.3.5   2024-04-22 [1] CRAN (R 4.3.1)
 gtools         * 3.9.5   2023-11-20 [1] CRAN (R 4.3.1)
 here           * 1.0.1   2020-12-13 [1] CRAN (R 4.3.0)
 hms              1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
 htmltools        0.5.7   2023-11-03 [1] CRAN (R 4.3.1)
 htmlwidgets      1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
 httr             1.4.7   2023-08-15 [1] CRAN (R 4.3.0)
 jsonlite         1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
 KernSmooth       2.23-22 2023-07-10 [1] CRAN (R 4.3.2)
 knitr            1.45    2023-10-30 [1] CRAN (R 4.3.1)
 labeling         0.4.3   2023-08-29 [1] CRAN (R 4.3.0)
 lazyeval         0.2.2   2019-03-15 [1] CRAN (R 4.3.0)
 lifecycle        1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
 lubridate      * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 magrittr         2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 mclust         * 6.1     2024-02-23 [1] CRAN (R 4.3.1)
 MetabolAnalyze * 1.3.1   2019-08-31 [1] CRAN (R 4.3.0)
 munsell          0.5.1   2024-04-01 [1] CRAN (R 4.3.1)
 mvtnorm        * 1.2-4   2023-11-27 [1] CRAN (R 4.3.1)
 pillar           1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 plotly         * 4.10.3  2023-10-21 [1] CRAN (R 4.3.1)
 purrr          * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6               2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp             1.0.12  2024-01-09 [1] CRAN (R 4.3.1)
 readr          * 2.1.5   2024-01-10 [1] CRAN (R 4.3.1)
 readxl         * 1.4.3   2023-07-06 [1] CRAN (R 4.3.0)
 rlang            1.1.4   2024-06-04 [1] CRAN (R 4.3.3)
 rmarkdown        2.25    2023-09-18 [1] CRAN (R 4.3.1)
 rprojroot        2.0.4   2023-11-05 [1] CRAN (R 4.3.1)
 rstudioapi       0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 scales           1.3.0   2023-11-28 [1] CRAN (R 4.3.1)
 sessioninfo      1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 stringi          1.8.4   2024-05-06 [1] CRAN (R 4.3.1)
 stringr        * 1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
 tibble         * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidyr          * 1.3.1   2024-01-24 [1] CRAN (R 4.3.1)
 tidyselect       1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse      * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 timechange       0.3.0   2024-01-18 [1] CRAN (R 4.3.1)
 tzdb             0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
 utf8             1.2.4   2023-10-22 [1] CRAN (R 4.3.1)
 vctrs            0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
 viridisLite      0.4.2   2023-05-02 [1] CRAN (R 4.3.0)
 withr            3.0.0   2024-01-16 [1] CRAN (R 4.3.1)
 xfun             0.42    2024-02-08 [1] CRAN (R 4.3.1)
 yaml             2.3.8   2023-12-11 [1] CRAN (R 4.3.1)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────